-
-
Notifications
You must be signed in to change notification settings - Fork 65
Add Internet Archive Fetching Pipeline #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull requests (PRs) can't be accepted until the instructions in the description are followed. Please edit the description and follow the instructions in the <!-- HTML comments -->)
Please run the Static analysis tools.
Updated Internet Archive Fetching Pipeline
|
I have made the requested changes and run static analysis @TimidRobot the code is ready for review. |
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the jurisdiction of ported licenses is not necessarily representative of the country of a given work. It should not be used indicate country.
|
@TimidRobot I have made the requested changes and the code is ready for review. |
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to import ArchiveSession, configure it, and then use it's search_items method.
You should be able to configure ArchiveSession in the same way that requests.Session is in:
quantifying/scripts/1-fetch/github_fetch.py
Lines 92 to 105 in 9aa5a8f
| def get_requests_session(): | |
| max_retries = Retry( | |
| total=5, | |
| backoff_factor=10, | |
| status_forcelist=GITHUB_RETRY_STATUS_FORCELIST, | |
| ) | |
| session = requests.Session() | |
| session.mount("https://", HTTPAdapter(max_retries=max_retries)) | |
| headers = {"accept": "application/vnd.github+json"} | |
| if GH_TOKEN: | |
| headers["authorization"] = f"Bearer {GH_TOKEN}" | |
| session.headers.update(headers) | |
| return session |
Please put STATUS_FORCELIST and USER_AGENT in shared.py. For example:
Lines 10 to 22 in fda007c
| STATUS_FORCELIST = [ | |
| 408, # Request Timeout | |
| 422, # Unprocessable Content (Validation failed,endpoint spammed, etc.) | |
| 429, # Too Many Requests | |
| 500, # Internal Server Error | |
| 502, # Bad Gateway | |
| 503, # Service Unavailable | |
| 504, # Gateway Timeout | |
| ] | |
| USER_AGENT = ( | |
| "QuantifyingTheCommons/1.0 " | |
| "(https://github.com/creativecommons/quantifying)" | |
| ) |
It looks like this is a difficult data source as far as the lack of normalization. Please work towards reducing the UNKNOWNS. Also, I would output them as errors and not save them into the data (or combine them into a single item if they can't be resolved).
|
I apologize. Something came up and I was a bit occupied in the past few days. I am on this now. |
- Normalize localized license URLs (e.g. /deed.de → base URL) - Add fallback resolution for ISO 639-2/B codes and locale tags in language normalization - Skip counting or saving results with UNKNOWN license or language - Log unmapped values as errors - Track unmapped license URLs and languages for diagnostics
## Summary of Changes ### Code Updates - **Removed port/jurisdiction counting** - **Dropped license url column on csv output files** - **Create session upon searching internet archive** - **Improved `normalize_license()`**: - Strips localized `/deed.xx` and `/legalcode.xx` suffixes - Normalizes scheme, host, and path for consistent mapping - **Enhanced `normalize_language()`**: - Adds fallback resolution for ISO 639-2/B codes (e.g. `"ger"` → `"German"`) - Normalizes languages using `pycountry` library - **Filtering Logic**: - Skips counting or saving results with `"UNKNOWN"` license or language - Logs unmapped values as `ERROR` for traceability - **Diagnostics**: - Tracks `unmapped_licenseurl_counter` and `unmapped_language_counter`
|
@TimidRobot I have made the requested changes and the code is ready for review. |
|
@TimidRobot I see issue #226 . I don’t think it’s straight forward for internet Archive to used shared.py to get sessions because it doesn’t use Session() but rather Archivesession(). |
Isn't |
Good thinking. Should I make the changes to shared.py or would it be handled by another issue. |
In this PR, please. |
|
All done @TimidRobot |
| unmapped_language_counter = Counter() | ||
|
|
||
| fields = ["licenseurl", "language"] | ||
| query = "creativecommons.org" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the query more specific: licenseurl:*creativecommons.org*
| try: | ||
| # Use search_items for simpler pagination management | ||
| search = session.search_items( | ||
| query, | ||
| fields=fields, | ||
| params={"rows": rows, "start": total_rows}, | ||
| request_kwargs={"timeout": 120}, | ||
| ) | ||
|
|
||
| # Convert to list to iterate over | ||
| results = list(search) | ||
| total_rows += len(results) | ||
| break | ||
|
|
||
| except Exception as e: | ||
| wait_time = 2**attempt | ||
| LOGGER.warning( | ||
| f"API request failed (Attempt {attempt+1}/{max_retries}). " | ||
| f"Waiting {wait_time}s.Error: {e}" | ||
| f"\n{traceback.format_exc()}" | ||
| ) | ||
| sleep(wait_time) | ||
| else: | ||
| raise shared.QuantifyingException( | ||
| f"Failed to fetch data after {max_retries} attempts.", 1 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The session already has built-in retry with exponential backoff. You don't need to manually handle retries.
| session = shared.get_session( | ||
| accept_header="application/json", session=ArchiveSession() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please assign session in main and pass to function
|
|
||
| fields = ["licenseurl", "language"] | ||
| query = "creativecommons.org" | ||
| license_mapping = load_license_mapping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please define license_mapping in main and pass to function
| pass | ||
|
|
||
| # --- Try Alias Map --- | ||
| ALIAS_MAP = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this constant to towards top of file with other constants
| "multiple": "Multiple Languages", | ||
| "music": "Undetermined", | ||
| } | ||
| ALIAS_MAP = {normalize_key(k): v for k, v in ALIAS_MAP.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend you assign a new variable here to differentiate it from constant
|
@jessbryte because of the complexity of the data source, there's still a good amount of work to do on this source. Please let me know if you'd like to:
|
|
@TimidRobot Let's merge with the feature branch and then create new issues that can be fixed. I can take up these issues later. Which feature branch would we use in this case? |
TimidRobot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you!
|
@TimidRobot, how would you recommend I proceed with further contributions to the Internet Archive branch? Do I continue from the main branch or how ? |
Fixes
Description
Technical details
Core Logic
The query_internet_archive function is the core fetching mechanism:
Tests
Confirm the script handles malformed or missing license fields without crashing.
Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin